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ABSTRACT 



A method for automatically limiting access of a client 
computer to data objects accessed through a server computer 
dynamically prevents robots or webcrawlers from obtaining 
too much of the server database and from dramatically 
reducing server performance. The method includes the steps 
of receiving a request for a data object, recording a log entry 
for the request, calculating client request values, and refus- 
ing the request if a client request value exceeds one of a set 
of corresponding predefined maximum request values. Each 
log entry contains a client identifier, timestamp, and at least 
one data object identifier for the request. The client request 
values preferably include a request frequency, which is 
compared with a predefined maximum request frequency, 
and a cumulative data request, which is compared with a 
data access threshold. If the client is refused access, the 
client identifier is added to a deny list, and future requests 
from the client are automatically denied. The calculated 
cumulative data request may be for a single client, or it rnay 
be for all clients, in order to detect a robot that is divided 
among multiple client identifiers.. The cumulative data 
request check may consider the total percentage of server 
resources being given away, or a pattern in the requests. Also 
provided is a data protection system containing a log file, a 
request analyzer, and a dynamically-generated deny list. 
Requests to the server are intercepted and sent to the data 
protection system first. 

24 Claims, 5 Drawing Sheets 
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SYSTEM AND METHOD FOR 
DYNAMICALLY LIMITING ROBOT ACCESS 
TO SERVER DATA 

FIELD OF THE INVENTION 

This invention relates generally to methods for limiting 
access of client computers over a computer network to data 
accessed through a server machine. More particularly, it 
relates to methods for monitoring client requests and deny- 
ing access to clients whose requests significantly reduce 
server performance, or who are attempting to obtain exces- 
sively large portions of server resources. 

BACKGROUND ART 

The popularization of the Internet is changing the ways in 
which information is typically distributed. Rather than using 
a limited number of print publications, such as books or 
magazines, or gaining access to libraries, a person can obtain 
a great deal of information by accessing a Web server using 
a browser on a client computer. 

Specialized Web sites exist that share large databases with 
the general public. For example, the U.S. Patent and Trade- 
mark Office (www.uspto.gov) provides a searchable full-text 
patent database containing all U.S. patents issued since 
1976. Similarly, IBM hosts a Java™ Web site 
(www.ibm.com/java) through which developers access tech- 
nical articles and case studies, and download code segments 
and other tools. Gourmet® and Bon Appetit® magazines 
jointly produce the Epicurious® (www.epicurious.com) 
Web site, which contains an enormous recipe database. Each 
of these sites allows users at client browsers to enter 
particular search queries, for example, patent classifications, 
code segment titles, or recipe ingredients. In response, the 
Web server provides the user with a set of matching Web 
pages. Each individual web page result can also be accessed 
directly using its Universal Resource Locator (URL). 

Most Web servers track the number of times their sites are 
accessed, termed "hits"; popular Web sites receive thou- 
sands of hits in a single day. When a request is made to a 
server (a GET message), the request is logged in a log file. 
Log files are not standardized, but generally contain a 
timestamp, an identifier for,the client, and a request string. 
Web sites can then use the number of hits to attract adver- 
tisers to their site, offsetting their maintenance costs and 
allowing them to continue to provide unlimited and free 
access. 

In addition to individual users, Web servers are also 
accessed heavily by robots, programs, that automatically 
traverse the Web to create an index. Robots, also known as 
spiders or webcrawlers, retrieve a document and then 
retrieve all the linked documents contained with the initial 
retrieved document, rapidly spreading throughout the Web. 
They may. also systematically march through every docu- 
ment on a server. Robots are most commonly, but not 
exclusively, used by search engines. One robot (ImageLock) 
records every single image it encounters to determine pos- 
sible Copyright infringers. Robots are not inherently 
destructive, but they can cause two significant problems for 
a Web server, both of which are referred to as "overcrawl- 
ing." First, if they request documents too frequently, they 
may significantly reduce a server* s performance. Second, it 
is possible (although often a violation of copyright law) to 
systematically download an entire Web site information 
repository using a robot, and then publish the information 
elsewhere. 
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Currently, these problems are addressed manually. If a 
system administrator notices a significant performance 
decrease, he or she can examine the log files to determine the 
source of the problem. If one robot is causing the problem, 

5 it can be excluded using the Robot Exclusion Standard: the 
system administrator creates a structured text file called/ 
robots.txt that indicates parts of the server that are off-limits 
to specific robots. In general, robots read the file before 
making a request, and do not, request files from which they 

10 are excluded. However, even if a robot does not follow the 
standard, it is possible to exclude it if its Internet Protocol 
(IP) address is known. 

Manual patrolling of log files is quite time-consuming for 
the system administrator, especially as a Web site's hit count 

15 grows. Because it cannot be done in real-time, a crawler is 
blocked only after it has slowed down site performance 
dramatically, or after it has downloaded significant amounts 
of server resources. 

A standard method for automatically limiting access to 

20 data is through the use of a firewall. A firewall is set of 
related programs that protect the resources of a private 
network by regulating access of outsiders to the network 
(and often also by regulating access of insiders to the 
Internet). Firewalls may allow outside access only to users 

25 with specific IP addresses or passwords, or may provide 
alarms when network security is being breached. However, 
they are generally not designed for protecting the resources 
of servers that provide information to the general public. 

30 A variety of systems have been developed to monitor 
access of clients to server data. TVvo broad categories are 
found: those for clients who have previously registered to 
access a server, and who provide an identification that must 
be authorized; and systems for analyzing client activity to 

3S develop statistical data and client profiles, which can be used 
for marketing or advertising purposes. Both types of moni- 
toring systems may also include features to determine if 
there is excessive traffic that will crash the server. Examples 
of the first category include U.S. Pat. No. No. 5,553,239, 

40 issued to Heath et al., which discloses a system and appa- 
ratus for monitoring a client's activity level during connec- 
tion to a server; and U.S. Pat. No. 5,708,780 to Levergood 
et al., which provides a system for monitoring the requests 
an authorized client makes to a server. These systems cannot 

45 be .used to address- the current problem, which occurs in 
publicly accessible servers. 

In the second category is U.S. Pat. No. 5,796,952, issued 
to Davis et al. In this system, a client profile is developed 
based on client requests and time spent using each requested 

50 file. A server stores information on the amount of data 
downloaded and the choices the client has made. Based on 
the data analysis, specific advertising can be sent to the 
client. This system does not address the problems detailed 
above, and is mainly concerned with the user's behavior 

55 after the requested file is sent to the client machine. 

Real-time log file analysis is commonly performed; com- 
mercial software packages are available and can be tailored 
to suit a Web server's specific needs. These software pack- 
ages maintain and analyze log files to create reports of 

60 demographics, purchasing habits, average time per visitor, 
and other information. In U.S. Pat. No. 5,787,253 to McCre- 
ery et al, an internet activity analyzer is disclosed. The 
analyzer provides source and destination information and 
indications of internet usage. It also detects potential server 

65 problems so that users may be notified. A real-time log file 
analyzer is also provided by U.S. Pat. No. 5,892,917, issued 
to Myerson. This analyzer creates supplemental log records 
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for cached files lhat were likely used to satisfy user requests, client computer over the distributed computer network when 

in order to create a more accurate profile of user activity. the request is refused. 

None of the prior art log file analyzers use the gathered The log entry comprises a client identifier, preferably an 

information to dynamically determine whether crawlers are IP address, and a timestamp for the request. In one 
abusing their access, either by excessively frequent requests 5 embodiment, the calculated client request values include a 

or by downloading excessive portions of the server database, request frequency for that client, calculated from the current 

and none can dynamically decide to refuse access. log entry and from previous log entries associated with the 

A ..... , t j j » . # i t . same client identifier. The set of corresponding predefined 

An additional problem, not addressed by the prior art, is . i ■ i j • 

. , . . / . * t i * * l.* maximum request values include a maximum request 

that there is not always a one- o -one correlaUon between f ^ me dfcm . s t fr is red 

robots and IP addresses, or other client identmers. For 10 ^ ^ maxUmim t fre ^ uencv £ determine whether 

example, in many corporations, users access the internet ^ dicnt shmild be rcfuscd acccss The maximum rcquest 

through a gateway server. All of the users then have the same frequency is defined as a number of requests x a in a time 

IP address, and may appear in a log file as a single user. per j od tj p re ferably, the predefined maximum request val- 

Conversely, a robot might deceptively use multiple. IP ues also include at least one additional, maximum request 

addresses to systematically download Web site information 15 frequency: x 2 requests in a time period t^, where x a is not 

without being detected. equal to x 2 and tj is not equal to t^. Multiple, independently 

There is a need, therefore, for a method for dynamically selectable maximum request frequencies help detect irregu- 

limiting robot access to server data as requests are being l ar patterns the robot may use to escape detection. 

m adt. I n a second embodiment, the log entry also includes at 

20 least one data object identifier, which may be a Universal 

OBJECTS AND ADVANTAGES Resource Locator (URL) for the data object. Alternately, the 

method includes an additional step of processing the request 

Accordingly, it is a primary object of the present invention t0 generate a result set containing at least one result data 

to provide a system and method for dynamically blocking object. In this case, the data object identifier corresponds to 

access of abusive robots to server resources. 2 5 the result data object, and the request must be processed 

It is an additional object of the invention to provide a before the log entry can be completed. In this embodiment, 

method that dynamically blocks a client from accessing a the client request values include a cumulative data request, 

server if it has made too many requests. a measure of how much of the server resources the client has 

T4 . „. t f(L t . . j already requested and received in the past. The set of 

It is another object of the present invention to provide a V? j £ j . i * i j 

, . , , J , „ , , r t . * ■ no corresponding predefined maximum request values includes 

method that dynamically mocKs a cuent trom accessing a - 

access threshold, the maximum amount or fraction of 
server if U is attempting to download a significant portion of ^ ^ dicm may feceive If ^ dicnl , s cumulative data 

the server s database. request exceeds the data access threshold, the client request 

It is a further object of the present invention to determine is refused. Either embodiment (frequency or data threshold) 

whether excessive requests from a single client identifier are may be used separately, or both may be used together, and 

from a gateway server and represent legitimate requests 35 the client may by refused access if any one of the client 

from multiple users. request values exceeds the corresponding predefined maxi- 

It is an additional object to track overcrawling from mum rec l uest values > or onl y if a11 of them do - 

different client identifiers that represent one robot. Alternately, the cumulative data request value may be for 

all previous requests, including those with different client 

SUMMARY 40 identifiers, not just clients having a single client identifier. 

However, only the current request is refused. 

Ihese objects and advantages are attained by a method for Thc mvention also provides a method having additional 

limiting access of a client computer to data objects acces- sleps of compariDg the client identifier with a deny list 

sible through a server computer in a distributed computer including denied client identifiers and refusing to send the 

network. Preferably, the distributed computer network is the 45 requested data object when the client identifier is on the deny 

Internet, and the data object is a Web page. The method is n st> [f one 0 r all of the client request values exceeds the 

implemented in the server and automatically recognizes corresponding predefined maximum request value, the client 

when a client computer is making requests too frequently or identifier is added to a dynamically-generated deny list. In 

is accessing too much of the server computer's resources. an alternate embodiment, if the client identifier is on an 

The quantitative definitions of "too frequently" and "too 50 exception list, the client identifier cannot be added to the 

much" are selected by a system administrator or equivalent deny list, even if the request values are too high, 

to accommodate the needs and limitations of the particular Finally, the invention provides a data protection system 

server. The method can detect three types of clients: a single associated with the server. The system includes a log file 

client making too frequent requests and accessing too much described above, a request analyzer, and a dynamical ly- 

of the server resources; a group of clients in a subnet mask, 55 generated deny list. The request analyzer calculates the 

in which the group requests too frequently for a single client, request values and compares them with the corresponding 

but does not access too much of the server data; and a single predefined maximum request values to generate failed client 

entity operating from multiple client computers but access- l deD V£ ers - Med chenl ldentlfiers « e added 10 th , e den y 

ing too much of the server resources. l f- Wh , eD * he se ™ r receives , rec * ue f from a kn ° wn 

" , , , #. . cheat, it refuses the request if the known client has a client 

The method has four steps: receiving a request for a data 60 idenlificr matchi onc of ^ c failcd clicnt i dentificr s. In a 

object from a client computer, recording a tog entry for the preferre d embodiment, the system also contains means for 

request m a log file, calculating client request values asso- removing a specific failed client identifier from the deny list, 
ciated with the client identifier from the log entry and from 

previous log entries, and refusing to send the requested data BRIEF DESCRIPTION OF THE FIGURES 

object if at least one of the client request values exceeds one 65 FIG. 1 is a block diagram of a distributed computer 

of a set of corresponding predefined maximum request network incorporating the data protection system of the 

values. Preferably, the server sends a refusal message to the present invention. 
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FIG. 2 is a block diagram of a method of the present contain Web browser programs through which users interact 

invention. with Web server 18 according to hypertext transfer protocol 

FIG. 3 is a block diagram of a preferred embodiment of (HTTP). In the browser, users enter Universal Resource 

the frequency check method of the present invention. Locators (URU) for desired Web pages, search queries, and 

. . . . . c e , , j. t c c other information. Users can also request pages by clicking 

FIG. 4 is a block diagram of a preferred embodiment of Qn h Unks ^in a hypertext markup lavage (HTML) 

the cumulative data check method of the present invention. documentt requests are sent t0 the Web m the 

FIG. 5 is a schematic diagram of a preferred combination f orm 0 f HTTP GET messages, and the server responds by 

of the frequency check and data check methods of the sending the user the requested data object. For example, the 

present invention. iq v s . Patent and Trademark Office (PTO) operates a Web 

FIG. 6 is a block diagram of a data protection system of server at www.uspto.gov that provides access to a searchable 

the present invention. database of US. patents. Users accessing the database 

DETAILED DESCRIPTION a ^P" 1 " " t > uest ™* P»8« °r 

enter search queries to find patents matching a set of 

Although the following detailed description contains J5 keywords, classifications, or other data. Client 12 may be 

many specifics for the purposes of illustration, anyone of used by an independent inventor who searches the PTO 

ordinary skill in the art will appreciate that many variations database at home. Employees of a company may search the 

and alterations to the following details are within the scope PTO database at clients 14 while at work. However, a robot 

of the invention. Accordingly, the following embodiments of 16 might be used to systematically download the entire 

the invention are set forth without any loss of generality to, 2Q database. 

and without imposing limitations upon, the claimed inven- The present invention provides a method and data pro- 

l i° n - tection system for limiting access of client computers to data 

FIG. 1 illustrates a distributed computer network 10 objects accessed through the server computer. Specifically, 

incorporating a data protection system 11 of the present the invention prevents overcrawling by robots, also known 

invention. Preferably, network 10 is the Internet, but it may 2 $ as spiders or webcrawlers, by denying access to robots that 

also be a corporate intranet or other network. Client com- make too frequent requests, which significantly reduce 

puters 12, 14, and 16 request and receive data objects, stored server performance, or that attempt to download systemati- 

in a database 20, from a Web server 18, which may be any cally an unacceptably large portion of the server database, 

type of server, running a conventional operating system such The method and system work by analyzing log files and 

as UNIX or Windows NT. When the present invention is 30 dynamically deciding whether to reject current and future 

included, the requests are intercepted by data protection requests from a specific robot. A key feature of the invention 

system 11 before being granted. System 21, including Web is that it blocks robots without blocking legitimate users. It 

server 18, data protection system 11, and database 20, may can then ensure that the majority of hits a Web site receives 

be implemented as a single computer or as several different are from legitimate users, and not from automatic robots. A 

computers. As illustrated with client 12, clients may access 35 Web site can use this guarantee to help attract advertisers. 

Web server 18 directly (through an Internet service The method can be implemented as a server plug-in that 

provider). Corporate clients 14 usually obtain access intercepts requests and only passes them on for server 

through a gateway server 22 on a local area network (LAN), processing if they satisfy all of the necessary criteria, 

which may limit their access to approved. Web sites. While A block diagram of the method is shown in FIG. 2. The 

robots, also known as spiders or webcrawlers, usually run on 40 method is carried out within data protection system 11 of 

a server, a robot 24 may also operate through a set of client FIG. 1 in cooperation with Web server 18 and database 20 

machines 16, which do not appear to server 18 to be and is preferably performed for every request sent to the 

connected with one another. Web server 18 can identify the server. A server receives a request 26 for a data object from 

source of requests by determining the Internet Protocol (IP) a client machine over a distributed computer network. In 

address from which the request originates. For a robot 45 step 28, a log entry for request 26 is recorded in a log file 30. 

operating on a machine 16, server 18 can directly determine Entries in log file 30 preferably contain a client identifier 32 

its IP address. For clients 14, however, server 18 can and a timestamp 34, and most preferably also a data object 

determine only the IP address of gateway 22. Subnets within identifier 36. Client identifier 32 is preferably an IP address, 

the LAN have distinct addresses, but these are not available but may also be a URL. Request values are calculated in step 

to server 18. In most cases, requests from clients 14 appear 50 38 from the log entry for request 26 and previous log entries, 

to be from a single source. The request values may be only for log entries having a 

As used here, the term "data object" refers to any discrete client identifier matching that of request 26, in which case 

piece of data stored on the server or on a different computer the request values are client request values, or they may be 

but accessed through the server. The data object may be a file for all log entries. In step 40, a comparison is made between 

or a database. For example, search engines have an index 55 the calculated request values and a set of corresponding 

database that they use to generate results matching user predefined maximum request values 42. If one of the request 

search queries. Preferably, a data object is a Web-page or values exceeds a corresponding predefined maximum 

HTML document, which may have other files, such as image request value, the request is refused in step 44. Otherwise, 

or audio files, embedded within or linked to the document. the requested data object is sent (step 46). Predefined 

The data object itself can also be an image, audio, or video 60 maximum request values are set by the system administrator 

file. or equivalent and can be changed as needed. 

Most of the users receiving data objects from server 20 are Preferably, a refused client identifier is added to a deny 

legitimate; they are infrequently obtaining a small amount of list. Incoming requests are compared with the deny list 

data for their own use, and do not intend to republish it. Data immediately after being logged in the log file, and automati- 

protection system 11 distinguishes them from abusive 65 cally denied if the client identifier is on the deny list. Most 

robots, who either reduce server performance or "steal" preferably, before the client identifier is added to the deny 

information to use for another purpose. In general, clients list, it is compared with an exception list, which contains 
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client identifiers that have been granted unlimited access and 
may not be refused. 

There are two main embodiments of the client request 
values and corresponding predefined maximum request val- 
ues 42. In the first main embodiment, the system monitors 
request frequency for a specific client identifier. For a new 
request, the server locates previous log entries having the 
same client identifier, and calculates a request frequency for 
that client. The request frequency is compared with a 
maximum request frequency. Preferably, there are many 
additional maximum request frequencies to which the client 
request frequency is compared, to detect irregular request 
patterns the client may use in an attempt to avoid detection. 
For example, a robot may have periods of inactivity punc- 
tuated by short bursts of high request frequency. The short 
bursts may lie up server bandwidth and should be prevented, 
but on a long lime scale, the request frequency appears 
reasonable. Alternately, the robot may make low frequency 
requests but, over time, obtain a Large portion of the server 
resources, in which case it will be caught by the long time 
period check. Each maximum request frequency is defined 
as a number of requests x, in a time period t ( -. 

FIG. 3 illustrates a preferred embodiment of the method 
for checking client request frequencies. First, in step 47, the 
GET message and IP address or other client identifier are 
obtained. In step 48, the relevant information is recorded in 
a log file 50. The client identifier is compared with a deny 
list (step 52) to determine whether it has been previously 
refused. If it is on the deny list, the client is automatically 
refused (step 54), and no further calculations are carried out. 

After the initial steps are (.completed, and if the client 
identifier is not on the deny list, the frequency checks are 
performed. FIG. 3 shows at least three frequency checks, but 
any number may be used, including only one if desired. In 
step 56, the number of requests the client identifier has made 
within a predefined time period tj is determined from the log 
file. Time period t a may be any time period, from millisec- 
onds to days, weeks, or even years. This number of requests 
is compared with a predefined maximum number, x 3 . Values 
for tj and x 2 are chosen by the system administrator or 
equivalent and depend upon the server, database, or Web site 
requirements. If the client identifier has more than x a 
requests, the client identifier is added to the deny list in step 
58. Before being ''added! it may be compared with' an 
exception list in step 60, and only added to the deny list if 
it is not in the exception list. After being added' to the deny 
list, the client is refused access to the requested data object 
in step 54. 

If the client request values pass first frequency check 56, 
further frequency checks are performed in steps 62 and 64: 
client requests are determined for time periods t^ through t„ 
and compared with through x„. The t/s are preferably all 
independently selectable, and the x/s are preferably all 
independently selectable. Preferably, the t/s and x/s are 
selected so that they provide a sequence of successive 
independent criteria. If the client fails any of the checks, the 
client identifier is added to the deny list (step 58) and the 
request is refused (step 54). If the client identifier passes last 
frequency check 64, the requested data object is sent in step 
66. Of course, an overall pass or fail may consist of any other 
suitable combination of passes and fails for the individual 
frequency checks. 

Alternately, the method illustrated in FIG. 3 can be carried 
out without using the deny list or the exception list. Steps 52, 
58, and 60 are removed, and "yes" responses to checks 56, 
62, and 64 lead directly to refusal, step 54. 
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In most cases, the frequency check is not sufficient for 
detecting abusive robots. In fact, the frequency check only 
detects robots that are quite blatant in their abuse. It also 
might reject multiple users connected through a subnet mask 

5 (clients 14 of FIG. 1), and who appear to be from a single 
IP address. A more accurate check involves examining the 
type and amount of. data being sent to a single IP address. 
Multiple users with a single IP address might exceed the 
maximum frequency, but will make random, uncorrelated 

3Q requests and, as a group, will repeatedly request the same 
data objects. Robots, on the other hand, make systematic 
requests and do not repeat requests. The two can be distin- 
guished by examining the data objects requested by a single 
client identifier. Two possible criteria for distinguishing 

5 robots are the percentage of the database they obtain and the 
systematic nature of their requests. For the system to be able 
to distinguish between the two, the log file must contain a 
data object identifier, so that the cumulative data requested 
by the client can be determined. 

2Q The cumulative data check is also useful for detecting 
crawlers that use multiple client identifiers, as shown in 
robot 24 of FIG. 1. A crawler might divide a systematic 
procedure among many client machines, each of which 
alone does not appear to be obtaining too much of the 

25 database. However, a study of all of the requests, not just 
those from a single client identifier, reveal that a large 
portion of the database is being sent out, or that a systematic 
procedure involving slight changes in requested URLs is 
being used. For example, if a client requests data objects 1, 

3 q 2, and 3, and then the same or a different client requests data 
.objects 4, 5, and 6, the requests might be from an abusive 
robot. In this case, each new IP address is rejected as it is 
discovered to be a part of the robot system. 

FIG. 4 illustrates a preferred embodiment of the check for 

35 amount of data being sent out. For the method of FIG. 4, the 
request values that are calculated include a cumulative data 
request, and the corresponding predetermined maximum 
values include a data access threshold. 

In the first step of FIG. 4, a request and client identifier 70 

40 are obtained. Two types of request are -possible. One is a 
simple GET message for a specific URL within the Web 
server. Alternately, the request may be a search query; in the 
example above for the PTO database, the query may be a 
keyword for a patent keyword search. An entry containing a 

45 client identifier and timestamp for the request is recorded in 
a lqgfile 74 in step 72. Next, in step 76,. the client identifier 
is compared with a deny list. If the client identifier is in the 
deny list, the request is refused, step 78. If the client 
identifier is not in the deny list, the request is processed in 

50 step 80; i.e., a search is performed for the query to generate 
a result set 82 containing result data objects. If the request 
does not require a search to be performed, then step 80 is 
unnecessary. In this case, result set 82 contains one or more 
requested URLs. 

55 Step 84 begins a series of steps that are carried out for 
each data object in the result set. In step 86, a cumulative 
data request is determined for the client identifier or, 
alternately, for all requests logged. The cumulative data 
request identifies all of the previous data objects that this 

60 client has obtained. Alternately, the cumulative data request 
identifies all of the data objects that have been obtained by 
all clients through this server. After calculating the cumu- 
lative data request, the system determines whether the extra 
data object requested will increase the cumulative data 

65 request over the predefined data access threshold. If not, a 
data object identifier for the element is added to the log entry 
for the current request for this client identifier in step 88, and 
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the system proceeds to the next element. Preferably, the data 
object identifier is a URL. 

If the cumulative data request does exceed the threshold 
when the current data object is included, the system then 
checks, in step 90, whether or not the current data object is 5 
already included in the log file, either for the client identifier 
or at all, depending on which type of check is being 
performed. If the data object is already in the log file, then 
the current reque'st is probably not from an abusive robot; 
sending it to the client is not giving away too much data. io 
However, if the current element has not been given away 
previously, then the client will be refused. In step 92, the 
exception list is checked before the client identifier is added 
to the deny list in step 94 and refused in step 78. 

If the data object passes the tests of steps 86 and 90, the 15 
system checks for more data objects in the result set in step 
96, and then moves to the next element in step 84. If all of 
the elements pass, then the results are sent in step 98. 
Preferably, if one of the elements fails, the client identifier 
is assumed to deserve blocking, and none of the result data 20 
objects will be sent. Of course, other criteria may be chosen. 

Additional steps may be added between steps 96 and 98. 
The system may need to process the request further, or it 
may need to supply other information along with the result 
data objects. For example, the system might create a results 25 
page containing a list of and hyperlinks to the search results, 
which it only does after determining that the request should 
be granted. In addition, only some of the server resources 
may be protected, while other resources are freely granted. ^ 

As shown in FIG. 4, steps 86 and 90 are used to determine 
the fraction of the database that has been requested and 
received by a client. Alternate embodiments of these steps 
may also be used. For example, a-robot may be systemati- 
cally marching through every file in the server database, by 35 
slightly altering a filename with every request, for example. 
The current request can be compared with previous requests 
to determine if there is a pattern. 

As with the frequency checks, the method for checking 
cumulative data request can be performed without steps 76, 40 
92, and 94, i.e., without the deny list and exception list. A 
"no" response to step 90 leads directly to refusal, step 78. 

The two main embodiments, the frequency check and 
cumulative data check, may be used together or separately. 
If they are used together, the request values include both a 45 
request frequency and a cumulative data request, and the 
corresponding predefined maximum values include both a 
maximum request frequency and a data access threshold. 
The client identifier fails and is rejected if one of the request 
values exceeds the corresponding predefined maximum 50 
request values. Alternately, the client identifier fails if all of 
the request values exceed the predefined maxima. Of course, 
any other number, between one and all, may be chosen. 

Preferably, for the combined embodiment, the frequency 
check is performed first, followed, by the cumulative data 55 
request check. Instead of sending the requested data object 
if the client identifier passes the frequency checks, as is done 
in step 66 of FIG. 3, the system passes the request to the 
cumulative data check, beginning with step 80 of FIG. 4. 
Alternately, the second test may used depending on the 60 
outcome of the first test — only if it is failed or only if it is 
passed. For example, there might be a frequency above 
which a cumulative data check is not needed; the frequency 
is so high and slows down the server so dramatically that 
blocking, the crawler is justified. gs 

FIG. 5 shows the flow of the preferred embodiment of a 
combined test, for the different types of clients. First, a 
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frequency check 100 is performed, followed by a data 
threshold check 102 or 104. Depending on the combined 
outcomes, the request is either refused (106 or 108) or sent 
(110). User examples are as follows: 

1. Overcrawling robot: single client with single IP 
address. Fails frequency check 100 and fails data check 
104: refused (106). 

2. Legitimate users: multiple clients with single IP 
address. Fails frequency check 100 but passes data 
check 104: data sent (110). 

3. Overcrawling robot: single client with multiple IP 
addresses. Passes frequency check 100 but fails data 
check 102: refused (108). 

4. Legitimate users: single client with single IP address 
(also multiple clients with multiple IP addresses). 
Passes frequency check 100 and passes data check 102: 
data sent (110). 

A preferred embodiment of a data protection system 112 
used to implement the method is displayed schematically in 
FIG. 6. A client identifier and request directed to a Web 
server 114 are intercepted and sent to a request analyzer 116. 
The request analyzer may be an additional computer, or it 
may be implemented as a Web server plug-in. The request is 
for a data object stored within database resources 115, which 
may be in server 114 or on a different computer. Request 
analyzer 116 may have individual analyzer pieces, shown in 
FIG. 6 as a request frequency analyzer 118, a cumulative 
data analyzer 120, and an exception analyzer 122. Request 
analyzer 116 interacts with a log file 124, a set of predefined 
maximum request values 126, a dynamically -generated deny 
list 128, and means for removing a specific failed client 
identifier from deny list 128, in this case an exception list 
130. 

The data protection system of FIG. 6 operates as follows. 
When a request from a known client is received by request 
analyzer 116, a log entry is made in log file 124 for the 
request. As discussed above, the log entry contains a client 
identifier and time stamp for the request, and preferably also 
a data object identifier for the requested data object. Next, 
request analyzer 116 checks whether the client identifier for 
the new request is in deny list 128, If it is, it sends a "fail" 
message to server 114, and server 114 refuses to send the 
requested data object. 

If the client identifier is not in deny list 128, request 
frequency analyzer 118 calculates the client request fre- 
quency from all of the relevant entries in log file 124. It then 
compares this request value with a corresponding element of 
set 126, the maximum request frequency. If the client request 
frequency exceeds the maximum request frequency, the 
client identifier is considered a failed client identifier, and is 
added to deny list 128. 

Next, cumulative data analyzer 120 adds a data object 
identifier to the current log entry in file 124. Depending -on 
the type of request, cumulative data, analyzer 120 may need 
to search database resources 115 to determine which data 
objects the client has requested. After a search result set is 
obtained, a data object identifier for a result data object is 
added to the current log entry in log file 124. Alternately, the 
request may be for a URL, and cumulative data analyzer 120 
simply adds the URL to the log entry. 

Based on the requested data objects, cumulative data 
analyzer 120 calculates the cumulative data request, either 
from previous log entries for the current client identifier, or 
from all previous log entries. It then compares the cumula- 
tive data request with a data access threshold in set 126. If 
the cumulative data request exceeds the data access 
threshold, the client identifier is a failed client identifier, and 
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is added to deny list 128. Depending on the decision of the 
system administrator, a client identifier that fails the fre- 
quency check but passes the cumulative data check may be 
removed from deny list 128. As explained above, the cumu- 
lative data check may be for a particular client identifier, or 5 
it may be for all data given to all clients, in order to detect 
a robot using multiple client identifiers. Request analyzer 
116 may include either or-both of request frequency ana- 
lyzer 118 and cumulative data analyzer 120. 

If the request passes both the request check and the 10 
cumulative data check, request analyzer 116 sends a mes- 
sage to server 114 to process the request and send the results 
to the client. If the request has failed one or both of the 
checks, depending on the configuration set up by the net- 
work administrator, the failed client identifier is send to is 
exception analyzer 122. If the failed client identifier is also 
on exception list 130, exception analyzer 122 removes the 
newly-added failed client identifier from deny list 128. 
Alternately exception analyzer 122 compares the failed 
client identifier with exception list 130 before it is added to 20 
deny list 128. If the client identifier is on the exception list, 
request analyzer 116 sends a message to server 114 that the 
request may be granted. 

If the request fails, and if the client identifier is not on the 
exception list, then request analyzer 116 sends a message to 25 
server 114 that the request must be refused. Server 114 may 
send a message to the client that the request was refused, 
with no further information. Preferably, the message informs 
the client why the request was refused, stating that the client 
has made too many requests, that the client has obtained too 30 
much of the server database, or that the client's request 
indicates that it is a part of a multi-client system of obtaining 
too much of the database. 

Most preferably, the message also states that if the user 
has a legitimate purpose for the apparent overcrawling, he or 35 
she may request to register with the Web site and be added 
to the exception list. Continuing the example of the PTO 
Web site, an intellectual property firm might have employees 
constantly sending requests and downloading patents from 
the database. At some point, it is likely that this firm will 40 
exceed either one of the maximum request frequencies or the 
data access threshold. However, the firm is not publishing 
the patents elsewhere. Upon receiving the rejection message, 
the firm can prove their legitimacy and register with the Web 
site. They will then be added to the exclusion list and will be 45 
able to make unlimited requests. The registration may expire 
at specific time intervals, for example, 6 months or one year, 
at which point the user must reregister. 

Of course, as robots learn of the present invention, they 
will devise methods for circumventing the frequency checks 50 
and data checks. Just as computer virus detection software 
is updated periodically as new viruses are developed, the 
present invention may also be updated periodically to keep 
up with new techniques robots develop. 

It will be clear to one skilled in the art that the above 55 
embodiment may be altered in many ways without departing 
from the scope of the invention. For example, the type of 
cumulative data used check may depend on the results of the 
frequency checks. Accordingly, the scope of the invention 
should be determined by the following claims and their legal 60 
equivalents. 

What is claimed is: 

1. A method for protecting a web server from abusive 
clients, the method comprising: 

a) receiving at said webserver client requests from said 65 

abusive clients comprising originating IP addresses and 

requested data objects; 
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b) recording log entries associated with the client requests 
in a request log file; 

c) denying at said webserver client requests from an IP 
address failing a frequency check, wherein the fre- 
quency check comprises determining from the log file 
if a request frequency from the IP address exceeds a 
predetermined maximum request frequency; 

d) denying at said webserver client requests from an IP 
address failing a cumulative data check, wherein the 
cumulative data check comprises determining from the 
log file if a cumulative data request value for the IP 
address exceeds a predetermined data access threshold; 
and 

e) denying at said webserver client requests from multiple 
IP addresses failing a second cumulative data check, 
wherein the second cumulative data check comprises 
identifying systematic request patterns, 

wherein said abusive clients are webcrawlers sending 
said client requests over the Internet to said web- 
server. 

2. In a server computer in a distributed computer network, 
a method for protecting data objects accessible on said 
server computer from excessive access by an abusive client 
computer, said method comprising the steps of: 

a) receiving at said server computer a request from said 
client computer for one of said data objects accessible 
on said server computer; 

b) recording a log entry associated with said request in a 
log file, wherein said log entry comprises a client 
identifier, at least one data object identifier, and a 
timestamp for said request, wherein said client identi- 
fier is an IP address; 

c) calculating client request values associated with said 
client identifier from said log entry and from previous 
log entries associated with said client identifier, 
wherein said client request values comprise a request 
frequency and a cumulative data request value; 

d) refusing to send from the server computer said 
requested data object to said client computer if at least 
one of said client request values exceeds one of a set of 
corresponding predefined maximum request values, 
wherein said predefined maximum request values com- 
prise a maximum request frequency and a data access 
threshold; and 

e) calculating a second cumulative data request value 
from said log entry and from previous log entries 
associated with all client identifiers, and refusing to 
send from the server computer said requested data 
object to said client computer if the second cumulative 
data request value exceeds a second data access 
threshold, 

wherein said distributed computer network is the Inter- 
net and said client computer is a webcrawler. 

3. The method of claim 1 further comprising sending a 
refusal message to said client computer over said distributed 
computer network if said requested data object is refused. 

4. The method of claim 1 wherein said maximum request 
frequency comprises a number of requests Xj in a time 
period t v 

5. The method of claim 4 wherein said predefined maxi- 
mum request values further comprise at least one additional 
maximum request frequency, wherein said additional maxi- 
mum request frequency comprises a number of requests x 2 
in a time period tj. 

6. The method of claim 2 wherein said requested data 
object is a Web page and said data object identifier com- 
prises a Universal Resource Locator for said Web page. 
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7. The method of claim 2 further comprising processing 
said request to generate a result set comprising at least one 
result data object, wherein said data object identifier corre- 
sponds to said result data object. 

8. The method of claim 2 wherein said request is refused 
if all of said client request values exceed said predefined 
maximum request values. 

9. The method of claim 2 further comprising a step of 
adding said client identifier to a deny list if one of said client 
request values exceeds one of said predefined maximum 
request values. 

10. In a server computer in a distributed computer 
network, a method for protecting data objects accessible on 
said server computer from excessive access by an abusive 
client computer, said method comprising the steps of: 

a) receiving at the server computer a request for one of 
said data objects from said client computer; 

b) recording a log entry associated with said request in a 
log file, wherein said log entry comprises a client 
identifier and a timestamp for said request; 

c) comparing said client identifier with a deny list com- 
prising denied client identifiers; 

d) refusing to send from said server computer said 
requested data object to said client computer if said 
client identifier is on said deny list; 

e) if said client identifier is not on said deny list, calcu- 
lating client request values associated with said client 
identifier from said log entry and from previous log 
entries associated with said client identifier, wherein 
said client request values comprise a request frequency 
and a cumulative data request value; and 

f) adding said client identifier to said deny list if at least 
one of said client request values exceeds one of a set of 
corresponding predefined maximum request values, 
wherein said predefined maximum request values com- 
prise a maximum request frequency and a data access 
threshold, 

wherein said distributed computer network is the Inter- 
net and said client computer is a webcrawler. 

11. The method of claim 10 wherein said maximum 
request frequency comprises a number of requests Xj in a 
time period t a . 

12. The method of claim 11 wherein said predefined 
maximum request values further comprise at least one 
additional maximum request frequency, wherein said addi- 
tional maximum request frequency comprises a number of 
requests x 2 in a time period t^. 

13. The method of claim 10 wherein said requested data 
object is a Web page and said data object identifier com- 
prises a Universal Resource Locator for said Web page. 

14. The method of claim 10 further comprising processing 
said request to generate a result set comprising at least one 
result data object, wherein said data object identifier corre- 
sponds to said result data object. 

15. The method of claim 10 further comprising adding 
said client identifier to said deny list if all of said client 
request values exceed said predefined maximum request 
values. 
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16. The method of claim 10 wherein said client identifier 
is an IP address. 

17. The method of claim 10 further comprising adding 
said client identifier to said deny list if one of said client 

5 request values exceeds one of said redefined maximum 
request values and said client identifier is not on an excep- 
tion list comprising allowed client identifiers. 

18. In a distributed computer network comprising a server 
computer and client computers, wherein said client comput- 

10 ers request data objects from said server computer, a data 
protection system associated with said server computer, said 
data protection system comprising: 

a) a log file comprising log entries, wherein each of said 
35 log entries corresponds to a client request from one of 

said client computers for one of said data objects from 
said server computer and comprises a client identifier, 
a data object identifier, and a timestamp; 

b) a request analyzer associated with said server computer 
20 for calculating client request values from said log 

entries, wherein said client request values comprise a 
client request frequency and a cumulative data request 
value, and for comparing said client request values with 
25 a set of corresponding predefined maximum request 
values comprising a maximum request frequency and a 
data access threshold, to generate failed client identi- 
fiers; and 

c) a dynamically-generated deny list comprising said 
30 failed client identifiers, wherein said request analyzer 

compares a new request from a known client computer 
with said deny list and refuses said new request if said 
known client computer has a client identifier matching 
one of said failed client identifiers, 
35 wherein said computer network is the Internet and said 
client computer is a webcrawler. 

19. The data protection system of claim 18, further 
comprising means for removing a specific failed client 
identifier from said deny list. 

40 20. The data protection system of claim 18 wherein said 
maximum request frequency comprises a number of requests 
x x in a time period t 2 . 

21. The data protection system of claim 20 wherein said 
predetermined criteria further comprise at least one addi- 

45 tional maximum request frequency, wherein said additional 
maximum request frequency comprises a number of requests 
x 2 in a time period t^, 

22. The data protection system of claim 18 wherein said 
requested data object is a Web page and said data object 

50 identifier comprises a Universal Resource Locator for said 
Web page. 

23. The data protection system of claim 18 wherein said 
client request is processed to generate a result set comprising 
at least one result data object, and wherein said data object 

55 identifier corresponds to said result data object. 

24. The data protection system of claim 18 wherein said 
client identifier is an IP address. 

* * * * * 
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